Skip to content

fix(sync): prevent concurrent-session rollbacks from destroying writes on in-memory SQLite#989

Merged
phernandez merged 2 commits into
mainfrom
fix/940-batch-relation-resolution-race
Jun 12, 2026
Merged

fix(sync): prevent concurrent-session rollbacks from destroying writes on in-memory SQLite#989
phernandez merged 2 commits into
mainfrom
fix/940-batch-relation-resolution-race

Conversation

@phernandez

Copy link
Copy Markdown
Member

Failure 1: test_sync_entity_circular_relations — lost relation row (the important one)

Root cause

The in-memory SQLite URL (sqlite+aiosqlite://) falls back to SQLAlchemy's StaticPool, which hands the same DBAPI connection to every concurrently checked-out session. All concurrent asyncio tasks therefore share one SQLite transaction scope:

  • a COMMIT by one session commits every sibling task's half-done statements;
  • a ROLLBACK by one session silently destroys every sibling task's executed-but-uncommitted statements.

Rollbacks are routine, not exceptional: the pool's reset-on-return issues a real DBAPI ROLLBACK at every connection checkin. A probe counting DBAPI calls during one sync of the two-file circular-relations scenario measured 40 commits and 41 rollbacks on the shared connection. scoped_session's exception handler is a second rollback source (any per-file error swallowed by _run_bounded).

During _index_changed_filesBatchIndexer.index_files, per-file upserts run concurrently (asyncio.gather bounded by index_entity_max_concurrent). The losing interleaving:

  1. task A's session finishes: DBAPI COMMIT (separate aiosqlite roundtrip);
  2. task B's add_all_ignore_duplicates executes INSERT INTO relation (B→A, in the gap between A's commit and A's checkin — both are independent await points);
  3. task A's connection checkin fires reset-on-return ROLLBACKB's INSERT is erased;
  4. task B "commits" an empty transaction. No error anywhere; sync reports success.

The final sync-level resolve_relations() pass cannot catch this: "pending" means a relation row with to_id IS NULL, and the destroyed INSERT leaves no row at all. Hence len(entity_b.outgoing_relations) == 0. #938 exposed the race by removing the multi-second per-entity embedding work that previously kept the upsert tasks from interleaving tightly. The window is microseconds, which is why it fired once on a loaded 2-core runner and never locally (30 pytest runs + 500 in-process iterations clean).

Production is not affected: file-based SQLite uses per-session pooled connections (isolated transactions, WAL) and Postgres uses NullPool. DatabaseType.MEMORY is only used by the test environment via engine_session_factory — but the unsafe engine configuration is product code in db.py.

Fix

_create_sqlite_engine now gives MEMORY engines a single-connection blocking pool (AsyncAdaptedQueuePool, pool_size=1, max_overflow=0). The lone connection keeps the in-memory database alive for the engine's lifetime (the reason StaticPool was chosen), while the blocking checkout serializes sessions at transaction granularity — restoring the isolation the repositories assume. No sleeps, no retries; nested-session misuse now fails loudly (pool checkout timeout) instead of silently sharing transactions, and one such test (test_delete_entity_with_relations) was un-nested.

Evidence

  • New regression test tests/db/test_memory_db_session_isolation.py fails deterministically on StaticPool (relation count 0 after a sibling session rolls back) and passes with the serialized pool.
  • Full unit suite green after the change (2946 passed); 200 in-process iterations of the circular-relations sync scenario green.

Failure 2: test_mcp_sse_forces_local — KeyError: 'FORCE_LOCAL' (Python 3.14 leg)

Root cause

The stdio variants of TestMcpCommandRouting invoke bm mcp, whose command body starts a real background auto-update daemon thread before mcp_server.run (the tests only mock run). That thread hits PyPI (5s timeout) and then rewrites config.json (auto_update_last_checked_at) with an in-place truncating Path.write_text. When that write lands while a later test's CLI invocation reads config.json (app callback → CliContainer.create()load_config()), the reader sees empty/partial JSON and load_config() raises SystemExit. CliRunner.invoke swallows it, the mocked mcp_server.run never executes, and the test fails with KeyError: 'FORCE_LOCAL' — exactly the CI shape. Nothing is 3.14-specific; the leg only shifted timing (the test also never inspected result, discarding the evidence).

The torn write is also a real production race: the MCP stdio server re-reads config.json on mtime change and exits the process on invalid JSON if it races a CLI save.

Fix

  • save_basic_memory_config writes a per-process/per-thread sibling temp file and publishes via os.replace — readers always see a complete document. Regression test fault-injects an interrupted write; fails against the old in-place write.
  • The stdio routing tests stub run_auto_update (no PyPI call / config write leaking across tests), and all four transport tests assert result.exit_code == 0 so future pre-run failures surface the real error.

Verification

  • uv run pytest tests — 2946 passed, 33 skipped
  • uv run pytest test-int/cli — 69 passed
  • Both regression tests proven to fail with their fix stashed and pass with it restored
  • uv run ruff check . (changed files clean; pre-existing hermes findings untouched), uv run ruff format ., uv run ty check src tests test-int — clean

🤖 Generated with Claude Code

Fixes #940

phernandez and others added 2 commits June 12, 2026 00:24
…s cannot destroy writes

Root cause of the test_sync_entity_circular_relations CI failure (#940,
len(entity_b.outgoing_relations) == 0): the in-memory SQLite URL
(sqlite+aiosqlite://) falls back to SQLAlchemy's StaticPool, which hands the
same DBAPI connection to every concurrently checked-out session. Concurrent
asyncio tasks therefore share one SQLite transaction scope. A rollback issued
through one session — scoped_session's exception handler, or the pool's
reset-on-return at connection checkin (~40 real ROLLBACKs per sync, measured)
— also rolls back any other task's executed-but-uncommitted statements.

During batch indexing, per-file tasks run concurrently (asyncio.gather bounded
by index_entity_max_concurrent). If a sibling session's checkin ROLLBACK lands
between one task's relation INSERT and its COMMIT, the relation row is silently
erased: no error is raised, the sync reports success. The final sync-level
resolve_relations() pass cannot heal this because it only re-queries rows with
to_id IS NULL — the destroyed INSERT leaves no row at all.

Fix: give MEMORY-type engines a single-connection AsyncAdaptedQueuePool
(pool_size=1, max_overflow=0). The lone connection keeps the in-memory
database alive for the engine's lifetime (the reason StaticPool was used),
while the blocking checkout serializes sessions at transaction granularity,
restoring the isolation the repositories assume. File-based SQLite and
Postgres engines are unchanged; production never uses MEMORY engines.

The regression test pins the invariant directly: a session that rolls back in
one task must never destroy another task's uncommitted writes. It fails
deterministically against StaticPool and passes with the serialized pool.

tests/repository/test_entity_repository.py held a scoped session open while
calling a repository method that opens its own session — tolerated on a shared
connection, a deadlock under a serialized pool — so the nested call moved out
of the session block.

Refs #940

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
… routing tests

Root cause of the test_mcp_sse_forces_local KeyError('FORCE_LOCAL') flake
(#940, Python 3.14 leg): the stdio variants of the mcp routing tests invoke
`bm mcp`, which starts a real background auto-update daemon thread before
mcp_server.run (the tests only mock run). That thread hits PyPI and then
rewrites config.json (auto_update_last_checked_at) with an in-place
Path.write_text, which truncates the file before writing. If that write lands
while a later test's CLI invocation is reading config.json (the app callback's
CliContainer.create()), load_config() sees empty/partial JSON and raises
SystemExit. CliRunner.invoke swallows it, the mocked mcp_server.run never
executes, and the test dies with KeyError on env_at_run['FORCE_LOCAL'] —
exactly the observed CI failure shape. Nothing is 3.14-specific; that leg
only shifted the timing.

The same torn write is user-visible in production: the MCP stdio server
re-reads config.json on mtime change and load_config() exits the process on
invalid JSON if it races a CLI save.

Fix: save_basic_memory_config writes a per-process/per-thread sibling temp
file and publishes it with os.replace, so readers always observe either the
old or the new complete document. The regression test injects an interrupted
write and asserts the published config stays untouched; it fails against the
old in-place write.

Test hardening: the stdio routing tests stub run_auto_update so no PyPI call
or config write leaks across tests, and all four transport tests now assert
result.exit_code == 0 so a future pre-run failure surfaces its real error
instead of a KeyError.

Refs #940

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: phernandez <paul@basicmachines.co>
@phernandez phernandez merged commit b3bdd59 into main Jun 12, 2026
23 checks passed
@phernandez phernandez deleted the fix/940-batch-relation-resolution-race branch June 12, 2026 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Timing-sensitive tests exposed by faster test suite (post-#938): circular-relations batch indexing, mcp sse routing

1 participant